This is a case study to analyze data and make predictions from the data collected by a FitBit Fitness Tracker. The data can be fetched from here.
After unzipping the files, its clearly visible that the files are not arranged in any meaningful way. Lets arrange the data according to the timeline they represent i.e., daily, hourly, minutely. The directory structure will look similar to this.
.
├── daily
│ ├── dailyActivity_merged.csv
│ ├── dailyCalories_merged.csv
│ ├── dailyIntensities_merged.csv
│ ├── dailySteps_merged.csv
│ └── sleepDay_merged.csv
├── heartrate_seconds_merged.csv
├── hourly
│ ├── hourlyCalories_merged.csv
│ ├── hourlyIntensities_merged.csv
│ └── hourlySteps_merged.csv
├── minutes
│ ├── minuteCaloriesNarrow_merged.csv
│ ├── minuteCaloriesWide_merged.csv
│ ├── minuteIntensitiesNarrow_merged.csv
│ ├── minuteIntensitiesWide_merged.csv
│ ├── minuteMETsNarrow_merged.csv
│ ├── minuteSleep_merged.csv
│ ├── minuteStepsNarrow_merged.csv
│ └── minuteStepsWide_merged.csv
└── weightLogInfo_merged.csv
Before diving in the data let’s first install the required libraries and include them.
install.packages("readr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("hms")
install.packages("plotly")
Now let’s import them in our memory.
library(readr)
library(dplyr)
library(ggplot2)
library(hms)
library(plotly)
library(gridExtra)
Now we have our tools and are ready to dive in, we will start by importing all files from daily folder into our program.
dailyActivity_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/dailyActivity_merged.csv")
dailyCalories_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/dailyCalories_merged.csv")
dailyIntensities_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/dailyIntensities_merged.csv")
dailySteps_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/dailySteps_merged.csv")
The purpose of this is to make sure our data is in the right format and if it has any NA values.
head(dailyActivity_merged)
## # A tibble: 6 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 8.5 0
## 2 1.50e9 4/13/2016 10735 6.97 6.97 0
## 3 1.50e9 4/14/2016 10460 6.74 6.74 0
## 4 1.50e9 4/15/2016 9762 6.28 6.28 0
## 5 1.50e9 4/16/2016 12669 8.16 8.16 0
## 6 1.50e9 4/17/2016 9705 6.48 6.48 0
## # … with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
head(dailyCalories_merged)
## # A tibble: 6 × 3
## Id ActivityDay Calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
head(dailyIntensities_merged)
## # A tibble: 6 × 10
## Id ActivityDay SedentaryMinutes LightlyActiveMinutes FairlyActiveMinu…
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 728 328 13
## 2 1503960366 4/13/2016 776 217 19
## 3 1503960366 4/14/2016 1218 181 11
## 4 1503960366 4/15/2016 726 209 34
## 5 1503960366 4/16/2016 773 221 10
## 6 1503960366 4/17/2016 539 164 20
## # … with 5 more variables: VeryActiveMinutes <dbl>,
## # SedentaryActiveDistance <dbl>, LightActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, VeryActiveDistance <dbl>
head(dailySteps_merged)
## # A tibble: 6 × 3
## Id ActivityDay StepTotal
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 13162
## 2 1503960366 4/13/2016 10735
## 3 1503960366 4/14/2016 10460
## 4 1503960366 4/15/2016 9762
## 5 1503960366 4/16/2016 12669
## 6 1503960366 4/17/2016 9705
After checking each column we can confirm that our data is in right format and there are no discrepancies.
Now we will check if there are any duplicate rows in the files. We will do this by defining a function which returns the number of duplicate rows.
count_duplicates <- function(dataframe){
n <- dataframe %>% nrow() - dataframe %>% unique() %>% nrow() #number of duplicate rows
return(n)
}
Now let’s call this function for every file.
count_duplicates(dailyActivity_merged)
## [1] 0
count_duplicates(dailyCalories_merged)
## [1] 0
count_duplicates(dailyIntensities_merged)
## [1] 0
count_duplicates(dailySteps_merged)
## [1] 0
Now all rows are unique and we will check for NA values in all the files.
dailyActivity_merged %>% is.na() %>% which()
## integer(0)
dailyCalories_merged %>% is.na() %>% which()
## integer(0)
dailyIntensities_merged %>% is.na() %>% which()
## integer(0)
dailySteps_merged %>% is.na() %>% which()
## integer(0)
There are no NA values in any of the files and our cleaning process is done.
Before we dive into the Process phase of our analysis process it’s important that we get familiar with the data first. Let’s check out the column names and try to find relations in the files.
We can use the colnames() function to see the column names of the files.
colnames(dailyActivity_merged)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
colnames(dailyCalories_merged)
## [1] "Id" "ActivityDay" "Calories"
colnames(dailyIntensities_merged)
## [1] "Id" "ActivityDay"
## [3] "SedentaryMinutes" "LightlyActiveMinutes"
## [5] "FairlyActiveMinutes" "VeryActiveMinutes"
## [7] "SedentaryActiveDistance" "LightActiveDistance"
## [9] "ModeratelyActiveDistance" "VeryActiveDistance"
colnames(dailySteps_merged)
## [1] "Id" "ActivityDay" "StepTotal"
Upon inspecting this data we find 3 things:
The 3rd point of out observation implies that we can get rid of all the files except dailyActivity_merged.
rm(dailyCalories_merged)
rm(dailyIntensities_merged)
rm(dailySteps_merged)
In this phase we will get rid of redundant elements and rename some rows.
All these changes will be saved to a new dataframe called daily_activity.
options(width = 1500)
daily_activity <- dailyActivity_merged %>% transform(active_minutes = VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes, ActivityDate = as.Date(ActivityDate, format = "%m/%d/%Y") ,weekday = weekdays(as.Date(ActivityDate, format = "%m/%d/%Y"))) %>% rename(inactive_minutes = SedentaryMinutes, date = ActivityDate, total_steps = TotalSteps, total_distance = TotalDistance) %>% select(Id, date, weekday, total_steps, total_distance, active_minutes, inactive_minutes, Calories)
colnames(daily_activity) <- tolower(colnames(daily_activity))
Our new data-frame looks something like this
## id date weekday total_steps total_distance active_minutes inactive_minutes calories
## 1 1503960366 2016-04-12 Tuesday 13162 8.50 366 728 1985
## 2 1503960366 2016-04-13 Wednesday 10735 6.97 257 776 1797
## 3 1503960366 2016-04-14 Thursday 10460 6.74 222 1218 1776
## 4 1503960366 2016-04-15 Friday 9762 6.28 272 726 1745
## 5 1503960366 2016-04-16 Saturday 12669 8.16 267 773 1863
## 6 1503960366 2016-04-17 Sunday 9705 6.48 222 539 1728
Let’s create a scatter plot for total_steps and total_distance, we assume it to be directly proportional as total_distance should increase linearly with total_steps
As it is clearly visible that our hypothesis was correct, this also means that total_distance is a redundant attribute and we can use total_steps only. We can also check this corelation by using cor() like this:
cor(daily_activity$total_steps, daily_activity$total_distance)
## [1] 0.9853688
The 0.98 value signifies that there is a strong relationship between total_steps and total_distance.
Now lets plot total_steps, active_minutes and inactive_minutes against calories, our hypothesis is that 1. total_steps directly proportional to calories
2. active_minutes directly proportional to calories 3. inactive_minutes inversely proportional to calories
To understand the relationship more easily let’s add a regression line to each plot.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
As it’s clear that our hypothesis is correct as the slope for total_steps/calories and active_minutes/calories is positive, it shows linear growth and inactive_minutes/calories is negative. The relationship is not really strong as there is lot of variance in the data.
cor(daily_activity$total_steps, daily_activity$calories)
## [1] 0.5915681
cor(daily_activity$active_minutes, daily_activity$calories)
## [1] 0.4719975
cor(daily_activity$inactive_minutes, daily_activity$calories)
## [1] -0.106973
It is clear by the values that relationship is not strong and if we try to fit a linear model it will not be an apt one.
Now that we have ploted and analyzed the relationship in raw data, lets find mean values of the attributes through the week and analyze it.
order_days <- function(data, x){
data$weekday <- factor(data$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
return(data)
}
This function will order days in general Monday to Saturday order.
mean_daily <- daily_activity %>% group_by(weekday) %>% summarize(mean_active = mean(active_minutes), mean_inactive = mean(inactive_minutes), mean_steps = mean(total_steps), mean_calories = mean(calories))
mean_daily <- order_days(mean_daily)
head(mean_daily, 7)
## # A tibble: 7 × 5
## weekday mean_active mean_inactive mean_steps mean_calories
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Friday 236. 1000. 7448. 2332.
## 2 Monday 229. 1028. 7781. 2324.
## 3 Saturday 244. 964. 8153. 2355.
## 4 Sunday 208. 990. 6933. 2263
## 5 Thursday 217. 962. 7406. 2200.
## 6 Tuesday 235. 1007. 8125. 2356.
## 7 Wednesday 224. 989. 7559. 2303.
This is the summarized data, we will visualize it now.
By analyzing the graphs we can find that people are 1. Most active on Saturday and Tuesday 2. Least active around Thursday and Sunday.
We will also analyze sleep and heartbeat data so let’s import sleep data first called sleepDay_merged
sleepDay_merged <- read_csv("archive/Fitabase Data 4.12.16-5.12.16/daily/sleepDay_merged.csv")
We should first check for duplicate enties and if there are, remove them.
count_duplicates(sleepDay_merged)
## [1] 3
sleepDay_merged <- unique(sleepDay_merged)
count_duplicates(sleepDay_merged)
## [1] 0
Now that we have eliminated duplicates, let’s check for NA values
sleepDay_merged %>% is.na() %>% which()
## integer(0)
head(sleepDay_merged)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
There are no NA values. Now we can move on to cleaning the data-frame. As we can see above that in SleepDay time is constant so we will only consider the date. We will change the SleepDay to date format, calculate weekday, and rename a few rows.
sleep_minutes <- sleepDay_merged %>% rename(id = Id, date = SleepDay, sleep_time = TotalMinutesAsleep, count = TotalSleepRecords, bed_time = TotalTimeInBed) %>% transform(date = as.Date(date, "%m/%d/%Y %I:%M:%S %p")) %>% transform(weekday = weekdays.Date(date)) %>% select(id, date, weekday, count, sleep_time, bed_time)
Let’s plot some graphs to analyze our data. We will plot the following relations: 1. sleep_time VS bed_time 2. Mean sleep_time per weekday 3. Mean count per weekday
Let’s plot sleep_time VS bed_time
cor(sleep_minutes$sleep_time, sleep_minutes$bed_time)
## [1] 0.9304224
As we can see that data is highly linear and there is a strong linear relationship between these attributes, we will fit a regression model to this later on.
Now we will calculate a summary table to find means.
sleep_mean <- sleep_minutes %>% group_by(weekday) %>% summarize(mean_sleep_time = mean(sleep_time), mean_count = mean(count))
sleep_mean <- order_days(sleep_mean)
head(sleep_mean, 7)
## # A tibble: 7 × 3
## weekday mean_sleep_time mean_count
## <fct> <dbl> <dbl>
## 1 Friday 405. 1.07
## 2 Monday 420. 1.11
## 3 Saturday 419. 1.19
## 4 Sunday 453. 1.18
## 5 Thursday 401. 1.03
## 6 Tuesday 405. 1.11
## 7 Wednesday 435. 1.15
Let’s plot mean_sleep_time per weekday.
We can conclude that people sleep: 1. Most on Wednesday and weekends(Sunday and Saturday) 2. Least on Thursday and Tuesday
We really don’t need to plot mean_count, we can just sort the table in descending order of mean_count.
head(sleep_mean %>% select(weekday, mean_count) %>% arrange(desc(mean_count)), 7)
## # A tibble: 7 × 2
## weekday mean_count
## <fct> <dbl>
## 1 Saturday 1.19
## 2 Sunday 1.18
## 3 Wednesday 1.15
## 4 Monday 1.11
## 5 Tuesday 1.11
## 6 Friday 1.07
## 7 Thursday 1.03
We can see that people take: 1. Most naps on Saturday, Sunday and Wednesday 2. Least naps on Thursday, Friday and Tuesday
We also concluded the same result from the previous plot.